Two-pass strategy for handling OOVs in a large vocabulary recognition task

نویسندگان

  • Odette Scharenborg
  • Stephanie Seneff
چکیده

This paper addresses the issue of large-vocabulary recognition in a specific word class. We propose a two-pass strategy in which only major cities are explicitly represented in the first stage lexicon. An unknown word model encoded as a phone loop is used to detect OOV city names (referred to as rare city names). After which SpeM, a tool that can extract words and word-initial cohorts from phone graphs on the basis of a large fallback lexicon, provides an V-best list of promising city names on the basis of the phone sequences generated in the first stage. This V-best list is then inserted into the second stage lexicon for a subsequent recognition pass. Experiments were conducted on a set of spontaneous telephone-quality utterances each containing one rare city name. We tested the size of the V-best list and three types of language models (LMs). The experiments showed that SpeM was able to include nearly 85% of the correct city names into an V-best list of 3000 city names when a unigram LM, which also boosted the unigram scores of a city name in a given

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Two-pass Strategy for Handling OOVs in a Large Vocabulary Recognition Task

This paper addresses the issue of large-vocabulary recognition in a specific word class. We propose a two-pass strategy in which only major cities are explicitly represented in the first stage lexicon. An unknown word model encoded as a phone loop is used to detect OOV city names (referred to as rare city names). After which SpeM, a tool that can extract words and word-initial cohorts from phon...

متن کامل

OOV Sensitive Named-Entity Recognition in Speech

Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named e...

متن کامل

Combining Semantic Word Classes and Sub-Word Unit Speech Recognition for Robust OOV Detection

Out-of-vocabulary words (OOVs) are often the main reason for the failure of tasks like automated voice searches or humanmachine dialogs. This is especially true if rare but task-relevant content words, e.g. person or location names, are not in the recognizer’s vocabulary. Since applications like spoken dialog systems use the result of the speech recognizer to extract a semantic representation o...

متن کامل

Improving out-of-vocabulary name resolution

This paper presents algorithms for generating targeted name lists for candidate out-of-vocabulary (OOV) words for applications in language processing, particularly speech recognition. Focusing on names, which are shown to be the dominant class of OOVs in news broadcasts, the approach involves offline generation of a large name list and online pruning based on a phonetic distance. The resulting ...

متن کامل

Handling Technical OOVs in SMT

We present a project on machine translation of software help desk tickets, a highly technical text domain. The main source of translation errors were out-of-vocabulary tokens (OOVs), most of which were either in-domain German compounds or technical token sequences that must be preserved verbatim in the output. We describe our efforts on compound splitting and treatment of non-translatable token...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005